Reminder: some codes are hidden from the html report.
To see the full code, please see the “FinalReport.Rmd”.
If you wanna run whole code, remember to turn all chunks’ eval option into TRUE
By the way, some analysis are already included in the Startercode “HappyDB_RShiny.Rmd” which is also included in the /doc folder.
This report won’t include those existing results.
This project aims to analysis what makes people happy, from the data collected in HappyDB
HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. The survey question is:
What made you happy today? Reflect on the past {24 hours|3 months}, and recall three actual events that happened to you that made you happy. Write down your happy moment in a complete sentence.
Write three such moments.
Examples of happy moments we are NOT looking for (e.g events in distant past, partial sentence):
- The day I married my spouse
- My Dog
You can read more about it on https://arxiv.org/abs/1801.07746
The text_processing.Rmd(provided by ) clean the text by converting all the letters to the lower case, and removing punctuation, numbers, empty words, extra white space, and stemming words, removing stop words, and choosing the words with highest frequency for each moments.
However, the original text_processing.Rmd has some weakness, which didn’t transform different verb tense to its simple tense very well.
# Load Data (Download if not exists)
if(!file.exists("../data/cleaned_hm.csv")){
urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/cleaned_hm.csv'
hm_data <- read_csv(urlfile)
}
# Process data if the processed_moments not exists
if(!file.exists("../output/processed_moments.csv")){
hm_data = read_csv("../data/cleaned_hm.csv")
source(purl("../lib/Text_Processing.Rmd",output=tempfile()))
}
if(!file.exists("../data/demographic.csv")){
urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'
demo_data <- read_csv(urlfile)
}
demo_data=read_csv("../data/demographic.csv")
hm_data=read_csv("../output/processed_moments.csv")
We also combine the happy moments data with the worker’s demographic data.
To make our research more specific, we filter out some special data like other gender, NA gender.
[Code hidden]
We can see, that demographic data include their gender, marital, parentihood, age, country, etc. And the original happy moments data consists of 1 sentence in each moment.
For detail Description, see HappyDB
We can perform a statistical analysis based on the number of words.
summary(hm_data$originalcount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 9.00 14.00 18.31 21.00 1155.00
So strange, how can the moments sentence be so long as 1155? Let’s see those long sentence with length > 1000
## [1] "This is the second essay in a two-part series about my journey to visit the Taj Mahal. Read the first part for the whole story.\r\n\r\nA\r\nGRA, Uttar Pradesh, India a As soon as my train to Agra leaves fro"
## [2] "This is the second essay in a two-part series about my journey to visit the Taj Mahal. Read the first part for the whole story.\r\n\r\nA\r\nGRA, Uttar Pradesh, India -- As soon as my train to Agra leaves fr"
## [3] "My much awaited desired prolonged Velankanni trip.\r\n\r\nOne of the greatest pleasures of being in Chennai is your proximity to a lot of one day getawaysa| and nothing more exciting than planning a trip."
Oh, I see, these people just ignore the requirement of the tasks! They wrote essays about their happy moments!! OK, just exclude them from the data. No way to let them increase their weight by 100 times huh. All plot in this report can cursor over it and see the detail number
hm_data=hm_data[hm_data$originalcount<500,]
summary(hm_data$originalcount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 9.00 14.00 18.16 21.00 483.00
hist(hm_data$originalcount[hm_data$originalcount<100],main="Happy Moments' Sentence Length",xlab = "Length")
Most of the happy moments are short sentences (<20 words), as expected!
Some happy moments even have only two words.
We tried to find out whether Demographic Difference influence Happiness.
Intuitively, we believe that Demographic may influence the emotion. For example, young people may be happier than old man as they have more time to play, Japanese man may suffer from high pressure…
The first naive solution is, find the frequency difference between happy moments of different demographic attributes.
sum(unlist(strsplit(hm_data$text,split = " "))=="spring")
## [1] 372
sum(unlist(strsplit(hm_data$text,split = " "))=="summer")
## [1] 720
sum(unlist(strsplit(hm_data$text,split = " "))=="winter")
## [1] 115
sum(unlist(strsplit(hm_data$text,split = " "))=="autumn")
## [1] 0
We tried to find “Autumn”, but there is zero “Autumn” mentioned in the moments, so we use “Fall” instead.
sum(unlist(strsplit(hm_data$text,split = " "))=="fall")
## [1] 246
However, fall may be the verb “fall” instead of season “Fall”
In conclusion, we find summer has the most occurrence in happy moments. Thus, summer may be the best source among seasons.
As we would like to identify interesting words for each group, we use TF-IDF to weigh each term within each speech.